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This paper describes the functioning of a broad-coverage probabilistic top-down parser, 
and its application to the problem of language modeling for speech recognition. The paper 
first introduces key notions in language modeling and probabilistic parsing, and briefly 
reviews some previous approaches to using syntactic structure for language modeling. 
A lexicalized probabilistic top-down parser is then presented, which performs very well, 
in terms of both the accuracy of returned parses and the efficiency with which they are 
found, relative to the best broad-coverage statistical parsers. A new language model which 
utilizes probabilistic top-down parsing is then outlined, and empirical results show that 
it improves upon previous work in test corpus perplexity. Interpolation with a trigram 
model yields an exceptional improvement relative to the improvement observed by other 
models, demonstrating the degree to which the information captured by our parsing model 
is orthogonal to that captured by a trigram model. A small recognition experiment also 
demonstrates the utility of the model. 

1. Introduction 

With certain exceptions, computational linguists have in the past generally formed a 
separate research community from speech recognition researchers, despite some obvious 
overlap of interest. Perhaps one reason for this is that, until relatively recently, few meth- 
ods have come out of the natural language processing community that were shown to 
improve upon the very simple language models still standardly in use in speech recog- 
nition systems. In the past few years, however, some improvements have been made 
over these language models through the use of statistical methods of natural language 
processing; and the development of innovative, linguistically well-motivated techniques 
for improving language models for speech recognition is generating more interest among 
computational linguists. While language models built around shallow local dependencies 
are still the standard in state-of-the-art speech recognition systems, there is reason to 
hope that better language models can and will be developed by computational linguists 
for this task. 

This paper will examine language modeling for speech recognition from a natural 
language processing point of view. Some of the recent literature investigating approaches 
that use syntactic structure in an attempt to capture long-distance dependencies for 
language modeling will be reviewed. A new language model, based on probabilistic top- 
down parsing, will be outlined and compared with the previous literature, and extensive 
empirical results will be presented which demonstrate its utility. 
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Two features of our top-down parsing approach will emerge as key to its success. 
First, the top-down parsing algorithm builds a set of rooted candidate parse trees from 
left-to-right over the string, which allows it to calculate a generative probability for each 
prefix string from the probabilistic grammar, and hence a conditional probability for 
each word given the previous words and the probabilistic grammar. A left-to-right parser 
whose derivations are not rooted, i.e. with derivations that can consist of disconnected 
tree fragments, such as an LR or shift-reduce parser, cannot incrementally calculate the 
probability of each prefix string being generated by the probabilistic grammar, because 
their derivations include probability mass from unrooted structures. Only at the point 
when their derivations become rooted (at the end of the string) can generative string 
probabilities be calculated from the gramma r. These parsers can calcula te word prob- 



abilities based upon the parser state - as in |Chclba and Jclinck (1998a| ) - but such a 
distribution is not generative from the probabilistic grammar. 

A parser that is not left-to-right, but which has rooted derivations, e.g. a head-first 
parser, will be able to calculate generative joint probabilities for entire strings; however 
it will not be able to calculate probabilities for each word conditioned on previously 
generated words, unless each derivation generates the words in the string in exactly the 
same order. For example, suppose that there are two possible verbs that could be the head 
of a sentence. For a head-first parser, some derivations will have the first verb as the head 
of the sentence, and the second verb will be generated after the first; hence the second 
verb's probability will be conditioned on the first verb. Other derivations will have the 
second verb as the head of the sentence, and the first verb's probability will be conditioned 
on the second verb. In such a scenario, there is no way to decompose the joint probability 
calculated from the set of derivations into the product of conditional probabilities using 
the chain rule. Of course, the joint probability can be used as a language model, but it 
cannot be interpolated on a word-by-word basis with, say, a trigram model, which we 
will demonstrate is a useful thing to do. 

Thus, our top-down parser allows for the incremental calculation of generative condi- 
tional word probabilities, a property it shares with other left-to-right parsers with rooted 
derivations such as Earley parsers (Earley, 1970) or left-corner parsers (Rosenkrantz and 
Lewis II, 1970|). 



A second key feature of our approach is that top-down guidance improves the effi- 
ciency of the search as more and more conditioning events are extracted from the deriva- 
tion for use in the probabilistic model. Because the rooted partial derivation is fully 
connected, all of the conditioning information that might be extracted from the top- 
down left context has already been specified, and a conditional probability model built 
on this information will not impose any additional burden on the search. In contrast, an 
Earley or left-corner parser will underspecify certain connections between constituents in 
the left-context, and if some of the underspecified information is used in the conditional 
probability model, it will have to become specified. Of course, this can be done, but at 
the expense of search efficiency; the more that this is done, the less of a benefit there 
is to be had from the underspecification. A top-down parser will, in contrast, derive an 
efficiency benefit from precisely the information that is left underspecified in these other 
approaches. 

Thus, our top-down parser makes it very easy to condition the probabilistic grammar 
on an arbitrary number of values extracted from the rooted, fully specified derivation. 
This has lead us to a formulation of the conditional probability model in terms of values 
returned from tree- walking functions that themselves are contextually sensitive. The top- 
down guidance that is provided makes this approach quite efficient in practice. 

The following section will provide some background in probabilistic context-free 
grammars and language modeling for speech recognition. There will also be a brief review 
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Figure 1 

Three parse trees: (a) a complete parse tree; (b) a complete parse tree with an explicit stop 
symbol; and (c) a partial parse tree 



of previous work using syntactic information for language modeling, before introducing 
our model in section |]. 

2. Background 

2.1 Grammars and trees 

This section will introduce probabilistic (or stochastic) context-free grammars (PCFGs)0, 
as well as such notions as complete and partial parse trees, which will be important in 
defining our language model later in the paper. In addition, we will explain some simple 
grammar transformations that will be used. Finally, we will explain the notion of c- 
command, which will be used extensively later as well. 

PCFGs model the syntactic combinatorics of a language by extending conventional 
context-free grammars (CFGs). A CFG G = (V, T, P, consists of a set of non-terminal 
symbols V, a set of terminal symbols T, a start symbol <E V, and a set of rule 
productions P of the form: A — > a, where ae(FU T)*. These context-free rules can be 
interpreted as saying that a non-terminal symbol A expands into onef] or more either non- 
terminal or terminal symbols, a = Xq . . . Xk- A sequence of context-free rule expansions 
can be represented in a tree, with parents expanding into one or more children below 
them in the tree. Each of the individual local expansions in the tree is a rule in the 
CFG. Nodes in the tree with no children are called leaves. A tree whose leaves consist 
entirely of terminal symbols is complete. Consider, for example, the parse tree shown in 
figure |l|(a): the start symbol is S\ which expands into an S. The S node expands into an 
NP followed by a VP. These non-terminal nodes each in turn expand, and this process 
of expansion continues until the tree generates the terminal string, "Spot chased the 
ball" , as leaves. 

A CFG G defines a language Lq, which is a subset of the set of strings of terminal 
symbols, including only those that are leaves of complete trees rooted at S\ built with 
rules from the grammar G. We will denote strings either as w or as WqWi . . . w n , where 
w n is understood to be the last terminal symbol in the string. For simplicity in displaying 



1 For a detailed introduction to PCFGs, see e.g. Manning and Schiitze (1999). 

2 For ease of exposition, we will ignore epsilon productions tor now. An epsilon production has the 
empty string (e) on the right-hand side, and can be written A — > e. Everything that is said here can 
be straightforwardly extended to include such productions. 
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equations, from this point forward let wj be the substring iUj . . .Wj. Let T w n be the set of 
all complete trees rooted at the start symbol, with the string of terminals Wq as leaves. 
We call T w n the set of complete parses of Wq. 

A PCFG is a CFG with a probability assigned to each rule; specifically, each right- 
hand side has a probability given the left-hand side of the rule. The probability of a 
parse tree is the product of the probabilities of each rule in the tree. Provided a PCFG 
is consistent (or tight), which it always will be in the approach we will be advocating^], 
this defines a proper probability distribution over completed trees. 

A PCFG also defines a probability distribution over strings of words (terminals) in 
the following way: 

PK) = ]T p(t) (i) 



The intuition behind equation [l] is that, if a string is generated by the PCFG, then it 
will be produced if and only if one of the trees in the set T w n generated it. Thus the 
probability of the string is the probability of the set T w ™ , i.e. the sum of its members' 
probabilities. 

Up to this point, we have been discussing strings of words without specifying whether 
they are "complete" strings or not. We will adopt the convention that an explicit begin- 
ning of string symbol, (s), and an explicit end symbol, (/s), are part of the vocabulary, 
and a string Wq is a complete string if and only if wq is (s) and w n is (/s). Since the 
beginning of string symbol is not predicted by language models, but rather is axiomatic 
in the same way that is for a parser, we can safely omit it from the current discussion, 
and simply assume that it is there. See figure |l](b) for the explicit representation. 

While a complete string of words must contain the end symbol as its final word, 
a string prefix does not have this restriction. For example, "Spot chased the ball 
(/s)" is a complete string, and the following is the set of prefix strings of this com- 
plete string: "Spot"; "Spot chased"; "Spot chased the"; "Spot chased the ball"; 
and "Spot chased the ball (/s)" . A PCFG also defines a probability distribution over 
string prefixes, and we will present this in terms of partial derivations. A partial deriva- 
tion (or parse) d is defined with respect to a prefix string w 3 as follows: it is the leftmost 
derivation^ of the string, with Wj on the right-hand side of the last expansion in the 
derivation. Let D w j be the set of all partial derivations for a prefix string w J . Then 

PK) = £ P(d) (2) 

deD j 

We left-factor the PCFG, so that all productions are binary, except those with a 
single terminal on the right-hand side and epsilon productions^. We do this because it 
delays predictions about what non-terminals we expect later in the string until we have 
seen more of the string. In effect, this is an underspecification of some of the predictions 
that our top-down parser is making about the rest of the string. The left-factorization 



transform that we use is identical to what is called right binarization in Roark and 



Johnson (199E ) . See that paper for more discussion of the benefits of factorization for 



3 A Pf!Fr, i s consistent or tight if there is no probability mass reserved for infinite trees. Chi and 



Geman (1998) proved that any PCFG estimated from a treebank with the relative frequency 



estimator is tight. All of the PCFGs that are used in this paper are estimated using the relative 
frequency estimator. 

4 A leftmost derivation is a derivation in which the leftmost non-terminal is always expanded. 

5 The only e-productions that we will use in this paper are those introduced by left factorization. 
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Figure 2 

Two parse trees: (a) a complete left-factored parse tree with epsilon productions and an 
explicit stop symbol; and (b) a partial left-factored parse tree 



top-down and left-corner parsing. For a grammar G, we define a factored grammar Gf 
as follows: 

i. {A—*B A-B) G G f iff (A -> B[3) G G, s.t. B G V and (3 G V* 

ii. (A-a -> B A-aB) G G f iff (A -> aB/3) G G, s.t. B G V", a G F+, and /3 G V" 

iii. (A-aB e) 6 G/ iff (A -> qlB) G G, s.t. S e V and a 6 V* 

iv. (A -> a) G Gf iff (.4 -> a) G G, s.t. a G T 

We can see the effect of this transform on our example parse trees in figure ||. This under- 
specification of the non-terminal predictions (e.g. VP-VBD in the example in figure |[ as 
opposed to NP), allows lexical items to become part of the left-context, and so be used 
to condition production probabilities, even the production probabilities of constituents 
that dominate them in the unfactored tree. It also brings words further downstream into 
the look-ahead at the point of specification. Note that partial trees are defined in exactly 
the same way (figure ||b), but that the non-terminal yields are made up exclusively of 
the composite non-terminals introduced by the grammar transform. 

This transform has a couple of very nice properties. First, it is easily reversible, i.e. 
every parse tree built with G / corresponds to a unique parse tree built with G. Second, if 
we use the relative frequency estimator for our production probabilities, the probability 
of a tree built with Gf is identical to the probability of the corresponding tree built with 



Finally, let us introduce the term c-command. We will use this notion in our con- 
ditional probability model, and it is also useful for understanding some of the previous 
work in this area. The simple definition of c-command that we will be using in this paper 
is the following: a node A c-commands a node B if and only if (i) A does not dominate^ 



6 A node A dominates a node B in a tree if and only if either (i) A is the parent of B; or (ii) A is the 
parent of a node C that dominates B. 



G. 
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B; and (ii) the lowest branching node (i.e. non-unary node) that dominates A also dom- 
inates B. Thus in figure 0(a), the subject NP and the VP each c-command the other, 
because neither dominates the other and the lowest branching node above both (the S) 
dominates the other. Notice that the subject NP c-commands the object NP, but not vice 
versa, since the lowest branching node that dominates the object NP is the VP, which 
does not dominate the subject NP. 

2.2 Language modeling for speech recognition 

This section will briefly introduce language modeling for statistical speech recognition^. 

In language modeling, we assign probabilities to strings of words. To assign a prob- 
ability, the chain rule is generally invoked. The chain rule states, for a string of k+1 
words: 

k 

PK fc ) = p^nPKK 1 ) (3) 

i=l 

A Markov language model of order n truncates the conditioning information in the chain 
rule to include only the previous n words. 

k 

P(4) = V{w Q )V(w 1 \w )...V{w n ^\wr 2 )Y{nw l \w^ 1 n ) (4) 

These models are commonly called n-gram models^]. The standard language model used 
in many speech recognition systems is the trigram model, i.e. a Markov model of order 
2, which can be characterized by the following equation: 

n-1 

PK" 1 ) = PWPm^nPkKl) (5) 

To smooth the trigram models that are used in this paper, we interpolate the proba- 



bility estimates of higher order Markov models with lower order Markov models ( Jelinck 



md Mercer, 1980). The idea behind interpolation is simple and has been shown to be 



very effective. For an interpolated (rt+l)-gram: 

P(^Kln) = A n (<;)P(^|^) + (l-A n Kli))PK|<Zn+l) (6) 

Here P is the empirically observed relative frequency, and A„ is a function from V n to 
[0,1]. This interpolation is recursively applied to the smaller order rt-grams until the 
bigram is finally interpolated with the unigram, i.e. Xq = 1. 

3. Previous work 

There have been attempts to jump over adjacent words to words farther back in the left- 



context, without the use of dependency links or syntactic structure, for example 3aul and 



Pereira (1997) and Rosenfeld (1996; 1997). We will focus our very brief review, however, 
on those which use grammars or parsing for their language models. These can be divided 
into two rough groups: those that use the grammar as a language model; and those that 



7 For a detailed introduction to statistical speech recognition, see Jelinek (1997). 

8 The n in n-gram is one more than the order of the Markov model, since the n-gram includes the 
word being conditioned. 
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use a parser to uncover phrasal heads standing in an important relation (c-command) to 
the current word. The approach that we will subsequently present uses the probabilistic 
grammar as its language model, but only includes probability mass from those parses 
that are found, i.e. it uses the parser to find a subset of the total set of parses (hopefully 
most of the high probability parses) and uses the sum of their probabilities as an estimate 
of the true probability given the grammar. 

3.1 Grammar models 



As mentioned in section 2.1, a PCFG defines a probability distribution over strings of 



words. One approach to syntactic language modeling is to use this distribution directly 



as a language mode l. There are efficient algorithms in the literature ( Jelinek and Laffcrty. 



199l| ; ptolcke, 1995|) for calculating exact string prefix probabilities given a PCFG. The al- 
gorithms both utilize a left-corner matrix, which can be calculated in closed form through 
matrix inversion. They are limited, therefore, to grammars where the non-terminal set 
is small enough to permit inversion. String prefix probabilities can be straightforwardly 
used to compute conditional word probabilities by definition: 

PK +1 K) = (7) 



Srolcke and Segal (1994') and Jurafsky et al. (1995| ) used these basic ideas to esti- 



mate bigram probabilities from hand-written PCFGs, which were then used in language 
models. Interpolating the observed bigram probabilities with these calculated bigrams 
led, in both cases, to improvements in word error rate over using the observed bigrams 
alone, demonstrating that there is some benefit to using these syntactic language models 
to generalize beyond observed n-grams. 

3.2 Finding phrasal heads 

Another approach that uses syntactic structure for language modeling has been to use 
a shift-reduce parser to "surface" c-commanding phrasal head words or part-of-speech 
(POS) tags from arbitrarily far back in the prefix string, for use in a trigram-like model. 

A shift-reduce parser^ operates from left-to-right using a stack and a pointer to the 
next word in the input string. Each stack entry consists minimally of a non-terminal 
label. The parser performs two basic operations: (i) shifting, which involves pushing the 
POS label of the next word onto the stack and moving the pointer to the following word 
in the input string; and (ii) reducing, which takes the top k stack entries and replaces 
them with a single new entry, the non-terminal label of which is the left-hand side of a 
rule in the grammar which has the k top stack entry labels on the right-hand side. For 
example, if there is a rule NP — > DT NN, and the top two stack entries are NN and DT, 
then those two entries can be popped off of the stack and an entry with the label NP 
pushed onto the stack. 

Goddeau (1992) used a robust deterministic shift-reduce parser to condition word 
probabilities by extracting a specified number of stack entries from the top of the current 
state, and conditioning on those entries in a way similar to an n-gram. In empirical trials, 
Goddeau used the top 2 stack entries to condition the word probability. He was able to 
reduce both sentence and word error rates on the ATIS corpus using this method. 



T he "Structured Language M odel " (SLM) used in Chclba and Jelinek ( [1998a] ; |l998b 



1999 ), Jelinek and Chclba (1999 ), and Chclba (2000 ) is similar to that of Goddeau, except 



that (i) their shift-reduce parser follows a non-deterministic beam search, and (ii) each 



9 For details, see e.g. Hopcroft and Ullman (1979) 
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Figure 3 

Tree representation of a derivation state 



stack entry contains, in addition to the non-terminal node label, the head-word of the 
constituent. The SLM is like a trigram, except that the conditioning words are taken 
from the tops of the stacks of candidate parses in the beam, rather than from the linear 
order of the string. 

Their parser functions in three stages. The first stage assigns a probability to the 
word given the left-context (represented by the stack state). The second stage predicts 
the POS given the word and the left-context. The last stage performs all possible parser 
operations (reducing stack entries and shifting the new word). When there is no more 
parser work to be done (or, in their case, when the beam is full), the following word is 
predicted. And so on until the end of the string. 

Each different POS assignment or parser operation is a step in a derivation. Each 
distinct derivation path within the beam has a probability and a stack state associated 
with it. Every stack entry has a non-terminal node label and a designated head word of 
the constituent. When all of the parser operations have finished at a particular point in 
the string, the next word is predicted as follows. For each derivation in the beam, the 
head words of the two topmost stack entries form a trigram with the conditioned word. 
This interpolated trigram probability is then multiplied by the normalized probability of 
the derivation, to provide that derivation's contribution to the probability of the word. 
More precisely, for a beam of derivations Di 

PK + tK) = (8) 



where hod and hid are the lexical heads of the top two entries on the stack of d. 

Figure ^ gives a partial tree representation of a potential derivation state for the 
string "the dog chased the cat with spots", at the point when the word "with" is 
to be predicted. The shift-reduce parser will have, perhaps, built the structure shown, and 
the stack state will have an NP entry with the head "cat" at the top of the stack, and a 
VBD entry with the head "chased" second on the stack. In the Chelba and Jelinek model, 
the probability of "with" is conditioned on these two head words, for this derivation. 

Since the specific results of the SLM will be compared in detail with our model when 
the empirical results are presented, at this point we will simply state that they have 
achieved a reduction in both perplexity and WER over a standard trigram using this 
model. 

The rest of this paper will present our parsing model, its application to language 
modeling for speech recognition, and empirical results. 
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4. Top-down parsing and language modeling 



Statistically-based heuristic best-first or beam-search strategies (Caraballo and Char 



oiak, 1998; Charniak, Goldwater, and Johnson, 1998; Goodman, 1997) have yielded an 
enormous improvement in the quality and speed of parsers, even without any guarantee 
that the parse returned is, in fact, that with the maximum likelihood for the probability 
model. The parsers with the highest published broad-coverage parsing accuracy, which 
include Charniak p97| ; |2000| ), Collins ( |l997| ; [l999| ) , and |Ratnaparkhi (1997| ) all utilize 
simple and straightforward statistically-based search heuristics, pruning the search space 
quite dramatically^]. Such methods are nearly always used in conjunction with some form 
of dynamic programming (henceforth DP). That is, search efficiency for these parsers is 
improved by both statistical search heuristics and DP. Here we will present a parser that 
uses simple search heuristics of this sort without DP. Our approach is found to yield very 
accurate parses efficiently, and, in addition, to lend itself straightforwardly to estimating 
word probabilities on-line, i.e. in a single pass from left-to-right. This on-line character- 
istic allows our language model to be interpolated on a word-by-word basis with other 
models, such as the trigram, yielding further improvements. 

Next we will outline our conditional probability model over rules in the PCFG, fol- 
lowed by a presentation of the top-down parsing algorithm. We will then present empirical 
results in two domains: one to compare with previous work in the parsing literature, and 
the other to compare with previous work using parsing for language modeling for speech 
recognition, in particular with the Chelba and Jelinek results mentioned above. 



4.1 Conditional probability model 

A simple PCFG conditi ons rule probabilities on th e left-hand side o f the rule. It has been 
shown repea t edly - e.g. Briscoe and Carroll (1993 ) , Charniak (1997 ) , |Collins (1997 ) , hiui 



aL (1997 ), Johnson (1998 ) - that conditioning the probabilities of structures on the 
context within which they appear, for example on the lexical head of a constituent 



( Charniak, 1997 ; Collins, 1997| ), on the label of its parent non-terminal ( Johnson, 1998 ), 
or, ideally, on both and many other things besides, leads to a much better parsing model 
and results in higher parsing accuracies. 

One way of thinking about conditioning the probabilities of productions on contex- 
tual information, e.g. the label of the parent of a constituent or the lexical heads of 
constituents, is as annotating the extra conditioning information onto the labels in the 
context-free rules. Examples of this are bilexical grammars - see e.g. Eisner and Satta 
(1999), Charniak (1997), Collins (1997) - where the lexical heads of each constituent are 



annotated on both the right- and left-hand sides of the context free rules, under the con- 
straint that every constituent inherits the lexical head from exactly one of its children, 
and the lexical head of a POS is its terminal item. Thus the rule S — > NP VP becomes, for 
instance, S[barks] — > NP[do<?] VP[barks\. One way to estimate the probabilities of these 
rules is to annotate the heads onto the constituent labels in the training corpus, and 
simply count the number of times particular productions occur (relative frequency esti- 
mation) . This procedure yields conditional probability distributions of constituents on the 
right-hand side with their lexical heads, given the left-hand side constituent and its lexi- 
cal head. The same procedure works if we annotate parent information onto constituents. 



10 Johnson et al. (1999j ), |h enderson and Brill (1999), and Collins (2000h demonstrate methods for 
choosing the best complete parse tree Irom among a set ot complete parse trees, and the latter two 
show accuracy improvements over some of the parsers cited above, from which they generated their 
candidate sets. Here I will be comparing my work with parsing algorithms, i.e. algorithms which 
build parses for strings of words. 
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This is how Johnson (1998) conditioned the probabilities of productions: the left-hand 
side is no longer, for example, S, but rather S^SBAR, i.e. an S with SBAR as parent. 
Notice, however, that in this case the annotations on the right-hand side are predictable 
from the annotation on the left-hand side (unlike, for example, bilexical grammars), so 
that the relative frequency estimator yields conditional probability distributions of the 
original rules, given the parent of the left-hand side. 

All of the conditioning information that we will be considering will be of this latter 
sort: the only novel predictions being made by rule expansions are the node-labels of 
the constituents on the right-hand side. Everything else is already specified by the left- 
context. We use the relative frequency estimator, and smooth our production probabilities 
by interpolating the relative frequency estimates with those obtained by "annotating" 
less contextual information. 

This perspective on conditioning production probabilities makes it easy to see that, 
in essence, by conditioning these probabilities, we are growing the state space. That is, 
the number of distinct non-terminals grows to include the composite labels; so does the 
number of distinct productions in the grammar. In a top-down parser, each rule expansion 
is made for a particular candidate parse, which carries with it the entire rooted derivation 
to that point; in a sense, the left-hand side of the rule is annotated with the entire left- 
context, and the rule probabilities can be conditioned on any aspect of this derivation. 

We do not use the entire left-context to condition the rule probabilities, but rather 
"pick-and-choose" which events in the left-context we would like to condition on. One 
can think of the conditioning events as functions, which take the partial tree structure 
as an argument and return a value, upon which the rule probability can be conditioned. 
Each of these functions is an algorithm for walking the provided tree and returning a 
value. For example, suppose that we want to condition the probability of the rule A — > a. 
We might write a function that takes the partial tree, finds the parent of the left-hand 
side of the rule and returns its node label. If the left-hand side has no parent, i.e. it is at 
the root of the tree, the function returns the null value (NULL). We might write another 
function that returns the non-terminal label of the closest sibling to the left of A, and 
NULL if no such node exists. We can then condition the probability of the production 
on the values that were returned by the set of functions. 

Recall that we are working with a factored grammar, so some of the nodes in the 
factored tree have non-terminal labels that were created by the factorization, and may 
not be precisely what we want for conditioning purposes. In order to avoid any confu- 
sions in identifying the non-terminal label of a particular rule production in either its 
factored or non-factored version, we introduce the function constituent (A) for every 
non-terminal in the factored grammar G/, which is simply the label of the constituent 
whose factorization results in A. For example, in figure^ constituent (NP-DT-NN) is 
simply NP. 

Note that a function can return different values depending upon the location in the 
tree of the non-terminal that is being expanded. For example, suppose that we have 
a function that returns the label of the closest sibling to the left of constituent (A) 
or NULL if no such node exists. Then a subsequent function could be defined as fol- 
lows: return the parent of the parent (the grandparent) of constituent (A) only if 
constituent (A) has no sibling to the left - in other words, if the previous function 
returns NULL; otherwise return the 2nd closest sibling to the left of constituent (A) , 
or, as always, NULL if no such node exists. If the function returns, for example, "NP" , 
this could either mean that the grandparent is NP or the 2nd closest sibling is NP; yet 
there is no ambiguity in the meaning of the function, since the result of the previous 
function disambiguates between the two possibilities. 

The functions that were used for the present study to condition the probability of 
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For all rules A 



a 



© A 



©the parent, Y p , of constituent (A) in the derivation 



© the closest sibling, Y s , to the left of constituent (A) in the derivation 



If Y s is CC, the leftmost child 
© of the conjoining category; else NULL the closest c-commanding lexical head to A 




the lexical head of constituent (A) if already seen; 
© otherwise the lexical head of the closest 

constituent to the left of A within constituent (A) 



the next closest c-commanding 
lexical head to A 



Figure 4 

Conditional probability model represented as a decision tree, identifying the location in the 
partial parse tree of the conditioning information 



the rule, A — > a, are presented in figure |j, in a tree structure. This is a sort of decision 
tree for a tree-walking algorithm to decide what value to return, for a given partial tree 
and a given depth. For example, if the algorithm is asked for the value at level 0, it will 
return A, the left-hand side of the rule being expanded]^]. Suppose the algorithm is asked 
for the value at level 4. After level 2 there is a branch in the decision tree. If the left-hand 
side of the rule is a POS, and there is no sibling to the left of constituent (A) in the 
derivation, then the algorithm takes the right branch of the decision tree to decide what 
value to return; otherwise the left branch. Suppose it takes the left branch. Then after 
level 3, there is another branch in the decision tree. If the left-hand side of the production 
is a POS, then the algorithm takes the right branch of the decision tree, and returns (at 
level 4) the POS of the closest c-commanding lexical head to A, which it finds by walking 
the parse tree; if the left-hand side of the rule is not a POS, then the algorithm returns 
(at level 4) the closest sibling to the left of the parent of constituent (^4) . 

The functions that we have chosen for this paper follow from the intuition (and 
experience) that what helps parsing is different depending on the constituent that is 
being expanded. POS nodes have lexical items on the right-hand side, and hence can 
bring some of the head-head dependencies into the model that have been shown to be so 
effective. If the POS is leftmost within its constituent, then very often the lexical item 
is sensitive to the governing category to which it is attaching. For example, if the POS 
is a preposition, then its probability of expanding to a particular word is very different 
if it is attaching to a noun phrase versus a verb phrase, and perhaps quite different 



11 Recall that A can be a composite non-terminal introduced by grammar factorization. When the 
function is defined in terms of constituent (.A) , the values returned are obtained by moving 
through the non-factored tree. 
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Conditioning 


Mnemonic 


Information level 





11U11L* 


kJllM^JlC -IT Vjl Vjr 


2,2,2 


par+sib 


Small amount of structural context 


5,2,2 


NT struct 


All structural (non-lexical) context for non-POS 


6,2,2 


NT head 


Everything for non-POS expansions 


6,3,2 


POS struct 


More structural info for leftmost POS expansions 


6,5,2 


attach 


All attachment info for leftmost POS expansions 


6,6,4 


all 


Everything 



Table 1 



Levels of conditioning information, mnemonic labels, and a brief description of the information 
level for empirical results 

depending on the head of the constituent to which it is attaching. Subsequent POSs 
within a constituent are likely to be open class words, and less dependent on these sorts 
of attachment preferences. 

Conditioning on parents and siblings of the left-hand side has proven to be very 
useful. To understand why this is the case, one need merely to think of VP expansions. 
If the parent of a VP is another VP (i.e. if an auxiliary or modal verb is used), then the 
distribution over productions is different than if the parent is an S. Conditioning on head 
information, both POS of the head and the lexical item itself, has proven useful as well, 
although given our parser's left-to-right orientation, in many cases the head has not been 
encountered within the particular constituent. In such a case, the head of the last child 
within the constituent is used as a proxy for the constituent head. All of our conditioning 
functions, with one exception, return either parent or sibling node labels at some specific 
distance from the left-hand side, or head information from c-commanding constituents. 
The exception is the function at level 5 along the left branch of the tree in figure ||. 
Suppose that the node being expanded is being conjoined with another node, which we 
can tell by the presence or absence of a CC node. In that case, we want to condition the 
expansion on how the conjoining constituent expanded. In other words, this attempts to 
capture a certain amount of parallelism between the expansions of conjoined categories. 

In presenting the parsing results, we will systematically vary the amount of condi- 
tioning information, so as to get an idea of the behavior of the parser. We will refer to the 
amount of conditioning by specifying the deepest level from which a value is returned for 
each branching path in the decision tree, from left to right in figure ||: the first number 
is for left-contexts where the left branch of the decision tree is always followed (non- 
POS non-terminals on the left-hand side); the second number for a left branch followed 
by a right branch (POS nodes that are leftmost within their constituent); and the third 
number for the contexts where the right branch is always followed (POS nodes that are 
not leftmost within their constituent). For example, (4,3,2) would represent a conditional 
probability model that (i) returns NULL for all functions below level four in all contexts; 

(ii) returns NULL for all functions below level three if the left-hand side is a POS; and 

(iii) returns NULL for all functions below level two for non-leftmost POS expansions. 



Table [L] gives a breakdown of the different levels of conditioning information used in 
the empirical trials, with a mnemonic label that will be used when presenting results. 
These different levels were chosen as somewhat natural points at which to observe how 
much of an effect increasing the conditioning information has. We first include struc- 
tural information from the context, i.e. node labels from constituents in the left context. 
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Then we add lexical information, first for non-POS expansions, then for leftmost POS 
expansions, then for all expansions. 

All of the conditional probabilities are linearly interpolated. For example, the prob- 
ability of a rule conditioned on six events is the linear interpolation of two probabilities: 
(i) the empirically observed relative frequency of the rule when the six events co-occur; 
and (ii) the probability of the rule conditioned on the first five events (which is in turn 
interpolated). The interpolation coefficients are a function of the frequency of the set of 
conditioning events, and are estimated by iteratively adjusting the coefficients so as to 
maximize the likelihood of a held out corpus. 

This was an outline of the conditional probability model that we used for the PCFG. 
The model allows us to assign probabilities to derivations, which can be used by the 
parsing algorithm to decide heuristically which candidates are promising and should 
be expanded, and which are less promising and should be pruned. We now outline the 
top-down parsing algorithm. 



4.2 Top-down Probabilistic Parsing 

This parser is essentially a stochastic version of the top-down parser described in Aho 



Sethi, and Ullman (1986 ). It uses a PCFG with a conditional probability model of the 
sort defined in the previous section. We will first define candidate analysis (i.e. a partial 
parse), and then a derives relation between candidate analyses. We will then present the 
algorithm in terms of this relation. 

The parser takes an input string Wq , a PCFG G, and a priority queue of candidate 
analyses. A candidate analysis C = (D, S, Pd, F, wf) consists of a derivation D, a stack 
S, a derivation probability Pjj, a figure-of- merit F, and a string u>™ remaining to be 
parsed. The first word in the string remaining to be parsed, Wi, we will call the look- 
ahead word. The derivation D consists of a sequence of rules used from G. The stack S 
contains a sequence of non-terminal symbols, and an end-of-stack marker $ at the bottom. 
The probability Pd is the product of the probabilities of all rules in the derivation D. 
F is the product of Pd and a look-ahead probability, LAP(S,Wi), which is a measure of 
the likelihood of the stack S rewriting with Wi at its left corner. 

We can define a derives relation, denoted =>, between two candidate analyses as 
follows. (£>, S, P D ,F, <) => (£>', S', P D , , F', w]) if and only fiQ 

i. D' = D + A -» X . . . X k 

ii. S = Aa%; 

iii. either S' = Xq . . . X k a$ and j = i 

or k — 0, Xq = Wi, j = and 5' = a$; 

iv. Pjy = P D P(A -+X ... X k ); and 

v. F' = P D ,LXP(S', Wj ) 

The parse begins with a single candidate analysis on the priority queue: ((),St$,1,1,u;q ) 
It then proceeds as follows. The top ranked candidate analysis, C = (D, S, Pd, F, u>"), is 
popped from the priority queue. If S = $ and Wi — (/s), then the analysis is complete. 
Otherwise, all C such that C C are pushed onto the priority queue. 

We implement this as a beam search. For each word position i, we have a separate 
priority queue, Hi, of analyses with look-ahead Wi. When there are "enough" analyses by 



12 Again, for ease of exposition, we will ignore e-productions. Everything presented here can be 

straightforwardly extended to include them. The + in (i) denotes concatenation. To avoid confusion 
between sets and sequences, will not be used for empty strings or sequences, rather the symbol () 
will be used. Note that the script <S is used to denote stacks, while is the start symbol. 
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some criteria (which we will discuss below) on priority queue -ffi+i, all candidate analyses 
remaining on Hi are discarded. Since w n = (/s), all parses that are pushed onto H n+ i 
are complete. The parse on H n+ i with the highest probability is returned for evaluation. 
In the case that no complete parse is found, a partial parse is returned and evaluated. 

The LAP is the probability of a particular terminal being the next left-corner of a 
particular analysis. The terminal may be the left-corner of the top- most non-terminal on 
the stack of the analysis or it might be the left-corner of the nth non-terminal, after the 
top n— 1 non-terminals have rewritten to e. Of course, we cannot expect to have adequate 
statistics for each non-terminal/ word pair that we encounter, so we smooth to the POS. 
Since we do not know the POS for the word, we must sum the LAP for all POS labels^]. 

For a PCFG G, a stack S = Aq . . . A n $ (which we will write Aq $) and a look-ahead 
terminal item Wi, we define the look-ahead probability as follows: 

LAP(S,uO= PgW^M (9) 

ae(VUT)* 

We recursively estimate this with two empirically observed conditional probabilities for 
every non-terminal Af. P(Ai — > WiCt ) and P(A, A e). The same empirical probability, 
P(Ai — > la), is collected for every pre-terminal X as well. The LAP approximation for 
a given stack state and look-ahead terminal is: 

P G (A]^ Wl a) w P G (A,- ± Wl a)+P(A 3 A e )P G (A™ +1 ^ Wl a) (10) 

where 

P G (A 3 A w t a) a X A] P(A 3 A Wt a) + (1 - A Aj ) ^ A Xa)P{X -» w t ) (11) 

xev 

The lambdas are a function of the frequency of the non-terminal Aj , in the standard way 
( Jclinck and Mercer, 1980| ). 



The beam threshold at word Wi is a function of the probability of the top ranked 
candidate analysis on priority queue -f/i+i and the number of candidates on Hi + \. The 
basic idea is that we want the beam to be very wide if there are few analyses that have 
been advanced, but relatively narrow if many analyses have been advanced. If p is the 
probability of the highest ranked analysis on -ffi+i, then another analysis is discarded if 
its probability falls below p/(7, |i?j+i|), where 7 is an initial parameter, which we call 
the base beam factor. For the current study, 7 was 10 -11 , unless otherwise noted, and 
/(7, = -f\Hi + i\ 3 . Thus, if 100 analyses have already been pushed onto -Hi+i, then 

a candidate analysis must have a probability above 10 -5 p to avoid being pruned. After 
1000 candidates, the beam has narrowed to 10~ 2 p. There is also a maximum number of 
allowed analyses on Hi, in case the parse fails to advance an analysis to Hi + ±. This was 
typically 10,000. 



As mentioned in section 2.1, we left-factor the grammar, so that all productions are 
binary, except those with a single terminal on the right-hand side and epsilon productions. 
The only e-productions are those introduced by left factorization. Our factored grammar 
was produced by factoring the trees in the training corpus before grammar induction, 
which proceeded in the standard way, by counting rule frequencies. 



13 Equivalently, we can split the analyses at this point, so that there is one POS per analysis. 
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5. Empirical results 

The empirical results will be presented in three stages: (i) trials to examine the accuracy 
and efficiency of the parser; (ii) trials to examine its effect on test corpus perplexity and 
recognition performance; and (iii) trials to examine the effect of beam variation on these 
performance measures. Before presenting the results, we will introduce the methods of 
evaluation. 



5.1 Evaluation 

Perplexity is a standard measure within the speech recognition community for comparing 
language models. In principle, if two models are tested on the same test corpus, the 
model that assigns the lower perplexity to the test corpus is the model closest to the true 
distribution of the language, and thus better as a prior model for speech recognition. 
Perplexity is the exponential of the cross entropy, which we will define next. 

Given a random variable X with distribution p and a probability model q, the cross 
entropy, H(p, q) is defined as follows: 

H(p,q) = -^p(z)log<z(x) (12) 

xex 

Let p be the true distribution of the language. Then, under certain assumptions^], given 
a large enough sample, the sample mean of the negative log probability of a model will 
converge to its cross entropy with the true model. That is 

H(p,q) = - lim -logqK 1 ) (13) 

n— *oo n 

where Wq is a string of the language L. In practice, one takes a large sample of the 
language, and calculates the negative log probability of the sample, normalized by its 
sizeQ. The lower the cross entropy (i.e. the higher the probability the model assigns to 
the sample), the better the model. Usually this is reported in terms of perplexity, which 
we will do as welQ 

Some of the trials discussed below will report results in terms of word and/or sen- 
tence error rate, which are obtained when the language model is embedded in a speech 
recognition system. Word error rate is the number of deletion, insertion, or substitution 
errors per 100 words. Sentence error rate is the number of sentences with one or more 
errors per 100 sentences. 

Statistical parsers are typically evaluated for accuracy at the constituent level, rather 
than simply whether or not the parse that the parser found is completely correct or not. 
A constituent for evaluation purposes consists of a label (e.g. NP) and a span (beginning 
and ending word positions). For example, in figure ^(a), there is a VP that spans the 
words "chased the ball" . Evaluation is carried out on a hand-parsed test corpus, and 
the manual parses are treated as correct. We will call the manual parse GOLD and the 
parse that the parser returns TEST. Precision is the number of common constituents in 
GOLD and TEST divided by the number of constituents in TEST. Recall is the number 
of common constituents in GOLD and TEST divided by the number of constituents 
in GOLD. Following standard practice, we will be reporting scores only for non-part- 
of-speech constituents, which are called labeled recall (LR) and labeled precision (LP). 



14 See 

15 It is 



Cover and Thomas (1991 
important to remember 



for a discussion of the Shannon-McMillan-Breiman theorem, 
o include the end marker in the strings of the sample. 



16 When assessing the magnitude of a perplexity improvement, it is often better to look at the 

reduction in cross entropy, by taking the log of the perplexity. It will be left to the reader to do so. 
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Conditioning 


LR 


LP 


CB 


CB 


< 2 
CB 


Pet. 
failed 


Avg. rule 
expansions 
considered* 


Average 
analyses 
advanced ' 


section 23: 2245 sentences of length < 40 


none 


71.1 


75.3 


2.48 


37.3 


62.9 


0.9 


14,369 


516.5 


par+sib 


82.8 


83.6 


1.55 


54.3 


76.2 


1.1 


9,615 


324.4 


NT struct 


84.3 


84.9 


1.38 


56.7 


79.5 


1.0 


8,617 


284.9 


NT head 


85.6 


85.7 


1.27 


59.2 


81.3 


0.9 


7,600 


251.6 


POS struct 


86.1 


86.2 


1.23 


60.9 


82.0 


1.0 


7,327 


237.9 


attach 


86.7 


86.6 


1.17 


61.7 


83.2 


1.2 


6,834 


216.8 


all 


86.6 


86.5 


1.19 


62.0 


82.7 


1.3 


6,379 


198.4 


section 23: 2416 sentences of length < 100 


attach 


85.8 


85.8 


1.40 


58.9 


80.3 


1.5 


7,210 


227.9 


all 


85.7 


85.7 


1.41 


59.0 


79.9 


1.7 


6,709 


207.6 



tper word 



Table 2 

Results conditioning on various contextual events, standard training and testing corpora 



Sometimes in figures we will plot their average, and also what can be termed the parse 
error, which is one minus their average. 

LR and LP are part of the standard set of PARSEVAL measures of parser quality 



(Black et al., 1991). From this set of measures, we will also include the crossing bracket 
scores: average crossing brackets (CB), percentage of sentences with no crossing brackets 
(0 CB), and the percentage of sentences with two crossing brackets or fewer (< 2 CB). 
In addition, we show the average number of rule expansions considered per word, i.e. 



the number of rule expansions for which a probability was calculated - see Ftoark and 
Charniak (2000| ) - and the average number of analyses advanced to the next priority 
queue per word. 

This is an incremental parser with a pruning strategy and no backtracking. In such 
a model, it is possible to commit to a set of partial analyses at a particular point that 
cannot be completed given the rest of the input string (i.e. the parser can garden path). 
In such a case, the parser fails to return a complete parse. In the event that no complete 
parse is found, the highest initially ranked parse on the last non-empty priority queue 
is returned. All unattached words are then attached at the highest level in the tree. In 
such a way we predict no new constituents and all incomplete constituents are closed. 
This structure is evaluated for precision and recall, which is entirely appropriate for these 
incomplete as well as complete parses. If we fail to identify nodes later in the parse, the 
recall will suffer, and if our early predictions were bad, both precision and recall will 
suffer. Of course, the percentage of these failures are reported as well. 

5.2 Parser accuracy and efficiency 

The first set of results looks at the performance of the parser on the standard corpora 
for statistical parsing trials: sections 2-21 (989,860 words, 39,832 sentences) of the Penn 
Treebank ( Marcus, Santorini, and Marcinkiewicz, 1993| ) serving as the training data, 



section 24 (34,199 words, 1,346 sentences) as the held-out data for parameter estimation, 
and section 23 (59,100 words, 2,416 sentences) as the test data. Section 22 (41,817 words, 
1,700 sentences) served as the development corpus, on which the parser was tested until 
stable versions were ready to run on the test data, to avoid developing the parser to fit 
the specific test data. 

Table shows trials with increasing amounts of conditioning information from the 
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(0,0,0) 




Parse error 
Rule expansions 



(2.2,2) 



(5,2,2) (6,2,2) (6,3,2) 

Conditioning information 



(6.5.2) 



(6,6,4) 



Figure 5 

Reduction in average precision/recall error and in number of rule expansions per word as 
conditioning increases, for sentences of length < 40 



left-context. There are a couple of things to notice from these results. First, and least 
surprising, is that the accuracy of the parses improved as we conditioned on more and 
more information. Like the non-lexicalized parser in Roark and Johnson (199E ) , we found 
that the search efficiency, in terms of number of rule expansions considered or number 
of analyses advanced, also improved as we increased the amount of conditioning. Unlike 
the Roark and Johnson parser, however, our coverage did not substantially drop as the 
amount of conditioning information increased, and in some cases improved slightly. They 
did not smooth their conditional probability estimates, and blamed sparse data for their 
decrease in coverage as they increased the conditioning information. These results appear 
to support this, since our smoothed model showed no such tendency. 

Figure || shows the reduction in parser error, 1 — LR + LP , and the reduction in rule 
expansions considered as the conditioning information increased. The bulk of the im- 
provement comes from simply conditioning on the labels of the parent and the closest 
sibling to the node being expanded, fnterestingly, conditioning all POS expansions on two 
c-commanding heads made no accuracy difference compared to conditioning only leftmost 
POS expansions on a single c-commanding head; but it did improve the efficiency 

These results, achieved using very straightforward conditioning events and consid- 
ering only the left-context, are within 1-4 points of the best published accuracies cited 
aboveQ. Of the 2416 sentences in the section, 728 had the totally correct parse, 30.1 
percent tree accuracy. Also, the parser returns a set of candidate parses, from which we 
have been choosing the top ranked; if we use an oracle to choose the parse with the 
highest accuracy from among the candidates (which averaged 70.0 in number per sen- 



17 Our score of 85.8 average labell 

section 23 compares to: 86.7 in Charn). 
89.6 in Charniak (2000), and 89.75 in Collins (2000| 



:H precision a.nrj recall for. 

9 : 



or equal r.o 1 00 on 

Latnaparkhi (199?j), 88.2 in |Collins (199S| ) 
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Figure 6 

Observed running time on section 23 of the Penn treebank, with the full conditional 
probability model and beam of 10 -11 , using one 300 Mhz UltraSPARC processor and 256MB 
of RAM of a Sun Enterprise 450 



tence), we find an average labelled precision/recall of 94.1, for sentences of length < 100. 
The parser, thus, could be used as a front end to some other model, with the hopes of 
selecting a more accurate parse from among the final candidates. 

While we have shown that the conditioning information improves the efficiency in 
terms of rule expansions considered and analyses advanced, what does the efficiency of 
such a parser look like in practice? Figure ^ shows the observed time at our standard 
base beam of 10~ n with the full conditioning regimen, alongside an approximation of 
the reported observed (linear) time in Ratnaparkhi (1997). Our observed times look 
polynomial, which is to be expected given our pruning strategy: the denser the competi- 
tors within a narrow probability range of the best analysis, the more time will be spent 
working on these competitors; and the farther along in the sentence, the more chance for 
ambiguities that can lead to such a situation. While our observed times are not linear, and 
are clearly slower than his times (even with a faster machine) , they are quite respectably 
fast. The differences between a k-best and a beam-search parser (not to mention the use 
of dynamic programming) make a running time difference unsurprising. What is perhaps 
surprising is that the difference is not greater. Furthermore, this is quite a large beam 
(see discussion below), so that very large improvements in efficiency can be had at the 
expense of the number of analyses that are retained. 
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5.3 Perplexity results 

The next set of results will highlight what recommends this approach most: the ease 
with which one can estimate string probabilities in a single pass from left-to-right across 
the string. By definition, a PCFG's estimate of a string's probability is the sum of the 
probabilities of all trees that produce the string as terminal leaves (see equation Q). In 
the beam-search approach outlined above, we can estimate the string's probability in 
the same manner, by summing the probabilities of the parses that the algorithm finds. 
Since this is not an exhaustive search, the parses that are returned will be a subset of 
the total set of trees that would be used in the exact PCFG estimate; hence the estimate 
thus arrived at will be bounded above by the probability that would be generated from 
an exhaustive search. The hope is that a large amount of the probability mass will be 
accounted for by the parses in the beam. The method cannot overestimate the probability 
of the string. 

Recall the discussion of the grammar models above, and our definition of the set of 
partial derivations D j with respect to a prefix string w J (see equations |^ and Q). By 
definition, 



\'(w j+1 ) ^deD j+1 P ^ 

PK +1 K) = -f^-i = ° (u) 



Note that the numerator at word Wj is the denominator at word iOj+i, so that the product 
of all of the word probabilities is the numerator at the final word, i.e. the string prefix 
probability. 

We can make a consistent estimate of the string probability by similarly summing 
over all of the trees within our beam. Let H\ mt be the priority queue H{ before any 
processing has begun with word u>i in the look-ahead. This is a subset of the possible 
leftmost partial derivations with respect to the prefix string u>g -1 . Since is produced 

by expanding only analyses on priority queue H\ nlt , the set of complete trees consistent 
with the partial derivations on priority queue if™]* is a subset of the set of complete trees 
consistent with the partial derivations on priority queue H™ lt , i.e. the total probability 
mass represented by the priority queues is monotonically decreasing. Thus conditional 
word probabilities defined in a way consistent with equation [l4| will always be between 
zero and one. Our conditional word probabilities are calculated as follows: 

P( *" W > " E^W) (15) 

As mentioned above, the model cannot overestimate the probability of a string, 
because the string probability is simply the sum over the beam, which is a subset of the 
possible derivations. By utilizing a figure-of-merit to identify promising analyses, we are 
simply placing our attention on those parses which are likely to have a high probability, 
and thus we are increasing the amount of probability mass that we do capture, of the 
total possible. It is not part of the probability model itself. 

Since each word is (almost certainly, because of our pruning strategy) losing some 
probability mass, the probability model is not "proper" , i.e. the sum of the probabilities 
over the vocabulary is less than one. In order to have a proper probability distribution, 
we would need to renormalize by dividing by some factor. Note, however, that this renor- 
malization factor is necessarily less than one, and thus would uniformly increase each 
word's probability under the model, i.e. any perplexity results reported below will be 
higher than the "true" perplexity that would be assigned with a properly normalized 
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LR 


LP 
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Perplexity 


Avg. rule 
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sections 23-24: 3761 sentences < 120 


unmodified 


all 


85.2 


85.1 


1.7 




7,206 


213.5 


no punct 


all 


82.4 


82.9 


0.2 




9,717 


251.8 


C&J corpus 


par+sib 


75.2 


77.4 


0.1 


310.04 


17,418 


457.2 


C&J corpus 


NT struct 


77.3 


79.2 


0.1 


290.29 


15,948 


408.8 


C&J corpus 


NT head 


79.2 


80.4 


0.1 


255.85 


14,239 


363.2 


C&J corpus 


POS struct 


80.5 


81.6 


0.1 


240.37 


13,591 


341.3 


C&J corpus 


all 


81.7 


82.1 


0.2 


152.26 


11,667 


279.7 



tper word 



Table 3 

Results conditioning on various contextual events, sections 23-24, modifications following 
Chelba and Jelinek 



distribution. In other words, renormalizing would make our perplexity measure lower 
still. The hope, however, is that the improved parsing model provided by our conditional 
probability model will cause the distribution over structures to be more peaked, thus 
enabling us to capture more of the total probability mass, and making this a fairly snug 
upper bound on the perplexity. 

One final note on assigning probabilities to strings: because this parser does garden 
path on a small percentage of sentences, this must be interpolated with another estimate, 
to ensure that every word receives a probability estimate. In our trials, we used the 
unigram, with a very small mixing coefficient: 

EdeH*n« p ( d ) 

P^K- 1 ) = AK- 1 )^ tt> +(l-AK- 1 ))P(« i ) (16) 

If ^2deH inil -P(^) = in our model, then our model provides no distribution over following 
words, since the denominator is zero. Thus, 

X[W ° } ~ 1 .999 otherwise [U) 



Chelba and Jelinek (1998a; 1998b) also used a parser to help assign word probabili- 



ties, via the Structured Language Model outlined in section 3.2. They trained and tested 
the SLM on a modified, more "speech-like" version of the Penn Treebank. Their mod- 
ifications included: (i) removing orthographic cues to structure (e.g. punctuation); (ii) 
replacing all numbers with the single token N; and (iii) closing the vocabulary at 10,000, 
replacing all other words with the UNK token. They used sections 00-20 (929,564 words) 
as the development set, sections 21-22 (73,760 words) as the check set (for interpolation 
coefficient estimation), and tested on sections 23-24 (82,430 words). We obtained the 
training and testing corpora from them (which we will denote C&J corpus), and also 
created intermediate corpora, upon which only the first two modifications were carried 
out (which we will denote no punct). Differences in performance will give an indication 
of the impact on parser performance of the different modifications to the corpora. All 
trials in this section used sections 00-20 for counts, held out 21-22, and tested on 23-24. 

Table || shows several things. First, it shows relative performance for unmodified, no 
punct, and C&J corpora with the full set of conditioning information. We can see that 
removing the punctuation causes (unsurprisingly) a dramatic drop in the accuracy and 
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Conditioning information 



Figure 7 

Reduction in average precision/recall error, number of rule expansions, and perplexity as 
conditioning increases 



efficiency of the parser. Interestingly, it also causes coverage to become nearly total, with 
failure on just two sentences per thousand on average. 

We see the familiar pattern, in the C&J corpus results, of improving performance as 
the amount of conditioning information grows. In this case we have perplexity results as 
well, and figure ^ shows the reduction in parser error, rule expansions, and perplexity 
as the amount of conditioning information grows. While all three seem to be similarly 
improved by the addition of structural context (e.g. parents and siblings), the addition 
of c-commanding heads has only a moderate effect on the parser accuracy, but a very 
large effect on the perplexity. The fact that the efficiency was improved more than the 
accuracy in this case (as was also seen in figure ||) , seems to indicate that this additional 
information is causing the distribution to become more peaked, so that fewer analyses 
are making it into the beam. 

Table W compares the perplexity of our model with Chelba and Jelinek ( |1998a| ; |1998b[ ) 
on the same training and testing corpora. We built an interpolated trigram model to 
serve as a baseline (as they did), and also interpolated our model's perplexity with the 
trigram, using the same mixing coefficient as they did in their trials (taking 36 percent 
of the estimate from the trigram)]^]. The trigram model was also trained on sections 
00-20 of the C&J corpus. Trigrams and bigrams were binned by the total count of the 
conditioning words in the training corpus, and maximum likelihood mixing coefficients 
were calculated for each bin, to mix the trigram with bigram and unigram estimates. 
Our trigram model performs at almost exactly the same level that theirs does, which is 
what we would expect. Our parsing model's perplexity improves upon their first result 
fairly substantially^ but is only slightly better than their second result. However, when 



18 Our optimal mixture level was closer to 40 percent, but the difference was negligible. 

19 Recall, that our perplexity measure should, ideally, be even lower still. 
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Paper 



Perplexity 



Trigram Base- 
line 



Model 



Interpolation, 
A=.36 



Chciba and Jeliiii'k (1998a 1 ) 



167.14 



158.28 



148.90 



Clrdba and JelniL'k (1998b 1 ) 



167.14 



153.76 



147.70 



Current results 



167.02 



Table 4 

Comparison with previous perplexity results 



152.26 



137.26 



we interpolate with the trigram, we see that the additional improvement is greater than 
the one they experienced. This is not surprising, since our conditioning information is 
in many ways orthogonal to that of the trigram, insofar as it includes the probability 
mass of the derivations; in contrast, their model in some instances is very close to the 
trigram, by conditioning on two words in the prefix string, which may happen to be the 
two adjacent words. 

These results are particularly remarkable, given that we did not build our model as 
a language model per se, but rather as a parsing model. The perplexity improvement 
was achieved by simply taking the existing parsing model and applying it, with no extra 
training beyond that done for parsing. 

The hope was expressed above that our reported perplexity would be fairly close to 
the "true" perplexity that we would achieve if the model were properly normalized, i.e. 
that the amount of probability mass that we lose by pruning is small. One way to test this 
is the following^} at each point in the sentence, calculate the conditional probability of 
each word in the vocabulary given the previous words, and sum them. If there is little loss 
of probability mass, the sum should be close to one. We did this for the first 10 sentences 
in the test corpus, a total of 213 words (including the end-of-sentence markers). One of 
the sentences was a failure, so that 12 of the word probabilities (all of the words after 
the point of the failure) were not estimated by our model. Of the remaining 201 words, 
the average sum of the probabilities over the 10,000 word vocabulary was 0.9821, with 
a minimum of 0.7960, and a maximum of 0.9997. Interestingly, at the word where the 
failure occurred, the sum of the probabilities was 0.9301. 



5.4 Word error rate 

In order to get a sense of whether these perplexity reduction results can translate to 
improvement in a speech recognition task, we performed a very small preliminary ex- 
periment on N-best lists. The DARPA '93 HUB1 test setup consists of 213 utterances 
read from the Wall St. Journal, a total of 3446 words. The corpus comes with a baseline 
trigram model, using a 20,000 word open vocabulary, and trained on approximately 40 
million words. We used Ciprian Chelba's A* decoder^] to find the 50 best hypotheses 
from each lattice, along with the acoustic and trigram scores. Given the idealized cir- 
cumstances of the production (text read in a lab), the lattices are relatively sparse, and 
in many cases 50 distinct string hypotheses were not found in a lattice. We reranked an 
average of 22.9 hypotheses with our language model per utterance. 

One complicating issue has to do with the tokenization in the Penn Treebank versus 
that in the HUB1 lattices. In particular, contractions (e.g. he's) are split in the Penn 



20 Th anks to Ciprian Chclba for this suggestion. 

21 See Chelba (2000) for details. 
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Model 


Training 
Size 


Vocabulary 
Size 


LM 

Weight 


Word Error- 
Rate % 


Sentence 
Error 
Rate % 


Lattice trigram 


40M 


20K 


16 


13.7 


69.0 


Chelba (2000 1 ) (A=.4) 


20M 


20K 


16 


13.0 




Current model 


1M 


10K 


15 


15.1 


73.2 


Treebank trigram 


1M 


10K 


5 


16.5 


79.8 


No language model 









16.8 


84.0 



Table 5 



Word and sentence error rate results for various models, with differing training and vocabulary 
sizes, for the best language model factor for that particular model 



Treebank (he 's) but not in the HUB1 lattices. Splitting of the contractions is critical 
for parsing, since the two parts oftentimes (as in the previous example) fall in different 
constituents. We follow Chelba (200C ) in dealing with this problem: for parsing purposes, 
we use the Penn Treebank tokenization; for interpolation with the provided trigram 
model, and for evaluation, the lattice tokenization is used. If we are to interpolate our 
model with the lattice trigram, we must wait until we have our model's estimate for the 
probability of both parts of the contraction; their product can then be interpolated with 
the trigram estimate. In fact, interpolation in these trials made no improvement over 
the better of the uninterpolated models, but simply resulted in performance somewhere 
between the better and the worse of the two models, so we will not present interpolated 
trials here. 

Table || reports the word and sentence error rates for five different models: (i) the 
trigram model that comes with the lattices, trained on approximately 40M words, with 
a vocabulary of 20,000; (ii) the best performing model from phelba (2000| ), which was 
interpolated with the lattice trigram at A=0.4; (iii) our parsing model, with the same 
training and vocabulary as the perplexity trials above; (iv) a trigram model with the same 
training and vocabulary as the parsing model; and (v) no language model at all. This 
last model shows the performance from the acoustic model alone, without the influence 
of the language model. The log of the language model score is multiplied by the language 
model (LM) weight when summing the logs of the language and acoustic scores, as a 
way of incre asing the relat ive contribution of the language model to the composite score. 
We followed |Chelba (200C| ) in using an LM weight of 16 for the lattice trigram. For our 
model and the treebank trigram model, the LM weight that resulted in the lowest error 
rates is given. 

The small size of our training data, as well as the fact that we are rescoring N-best 
lists, rather than working directly on lattices, makes comparison with the other mod- 
els not particularly informative. What is more informative is the difference between our 
model and the trigram trained on the same amount of data. We achieved an 8.5 per- 
cent relative improvement in word error rate, and an 8.3 percent relative improvement in 
sentence error rate over the treebank trigram. Interestingly, as mentioned above, interpo- 
lating two models together gave no improvement over the better of the two, whether our 
model was interpolated with the lattice or the treebank trigram. This contrasts with our 
perplexity results reported above, as well as with the recognition experiments in Chelba 



(2000), where the best results resulted from interpolated models. 



The point of this small experiment was to see if our parsing model could provide 
useful information even in the case that recognition errors occur, as opposed to the 
(generally) fully grammatical strings upon which the perplexity results were obtained. 
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Base 

Beam 

Factor 


LR 


LP 


Pet. 
failed 


Perplexity 
A=0 


Perplexity 
A=. 36 


Avg. rule 
expansions 
considered^ 


Words per 
second 


sections 23-24: 3761 sentences < 120 


ltr 11 


81.7 


82.1 


0.2 


152.26 


137.26 


11,667 


3.1 


1(T 10 


81.5 


81.9 


0.3 


154.25 


137.88 


6,982 


5.2 


10 - 9 


80.9 


81.3 


0.4 


156.83 


138.69 


4,154 


8.9 


10" 8 


80.2 


80.6 


0.6 


160.63 


139.80 


2,372 


15.3 


io- 7 


78.8 


79.2 


1.2 


166.91 


141.30 


1,468 


25.5 


1(T 6 


77.4 


77.9 


1.5 


174.44 


143.05 


871 


43.8 


10~ 5 


75.8 


76.3 


2.6 


187.11 


145.76 


517 


71.6 


kt 4 


72.9 


73.9 


4.5 


210.28 


148.41 


306 


115.5 


1(T 3 


68.4 


70.6 


8.0 


253.77 


152.33 


182 


179.6 



tper word 



Table 6 

Results with full conditioning on the C&J corpus at various base beam factors 



As one reviewer pointed out, given that our model relies so heavily on context, it may 
have difficulty recovering from even one recognition error, perhaps more difficulty than a 
more locally-oriented trigram. While the improvements over the trigram model in these 
trials are modest, they do indicate that our model is robust enough to provide good 
information even in the face of noisy input. Future work will include more substantial 
word recognition experiments. 

5.5 Beam variation 

The last set of results that we will present addresses the question of how wide the beam 
must be for adequate results. The base beam factor that we have used to this point is 
10~ n , which is quite wide. It was selected with the goal of high parser accuracy; but in 
this new domain, parser accuracy is a secondary measure of performance. To determine 
the effect on perplexity, we varied the base beam factor in trials on the Chelba and Jelinek 
corpora, keeping the level of conditioning information constant, and table || shows the 
results across a variety of factors. 

The parser error, parser coverage, and the uninterpolated model perplexity (A = 1) all 
suffered substantially from a narrower search, but the interpolated perplexity remained 
quite good even at the extremes. Figure |^ plots the percentage increase in parser error, 
model perplexity, interpolated perplexity, and efficiency (i.e. decrease in rule expansions 
per word) as the base beam factor decreased. Note that the model perplexity and parser 
accuracy are quite similarly affected, but that the interpolated perplexity remained far 
below the trigram baseline, even with extremely narrow beams. 

6. Conclusion and Future Directions 

The empirical results presented above are quite encouraging, and the potential of this 
kind of approach both for parsing and language modeling seems very promising. With 
a simple conditional probability model, and simple statistical search heuristics, we were 
able to find very accurate parses efficiently, and, as a side effect, were able to assign word 
probabilities that yield a perplexity improvement over previous results. These perplexity 
improvements are particularly promising, because the parser is providing information 
that is, in some sense, orthogonal to the information provided by a trigram model, as 
evidenced by the robust improvements to the baseline trigram when the two models are 
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Figure 8 

Increase in average precision/recall error, model perplexity, interpolated perplexity, and 
efficiency (i.e. decrease in rule expansions per word) as base beam factor decreases 



interpolated. 

There are several important future directions that will be taken in this area. First, 
there is reason to believe that some of the conditioning information is not uniformly 
useful, and we would benefit from finer distinctions. For example, the probability of a 
preposition is presumably more dependent on a c-commanding head than the probability 
of a determiner is. Yet in the current model they are both conditioned on that head, as 
leftmost constituents of their respective phrases. Second, there are advantages to top- 
down parsing that have not been examined to date, e.g. empty categories. A top-down 
parser, in contrast to a standard bottom-up chart parser, has enough information to 
predict empty categories only where they are likely to occur. By including these nodes 
(which are in the original annotation of the Penn Treebank), we may be able to bring 
certain long distance dependencies into a local focus. In addition, as mentioned above, 
we would like to further test our language model in speech recognition tasks, to see if 
the perplexity improvement that we have seen can lead to significant reductions in word 
error rate. 

Other parsing approaches might also be used in the way that we have used a top- 
down parser. Earley and left-corner parsers, as mentioned in the introduction, also have 
rooted derivations that can be used to calculated generative string prefix probabilities 
incrementally. In fact, left-corner parsing can be simulated by a top-down parser by trans- 
forming the grammar, as was done in Roark and Johnson (1999), and so an approach 
very similar to the one outlined here could be used in that case. Perhaps some compro- 
mise between the fully connected structures and extreme underspecification will yield an 
efficiency improvement. Also, the advantages of head-driven parsers may outweigh their 
inability to interpolate with a trigram, and lead to better off-line language models than 
those that we have presented here. 

Does a parsing model capture exactly what we need for informed language modeling? 
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The answer to that is no. Some information is simply not structural in nature (e.g. topic), 
and we might expect other kinds of models to be able to better handle non-structural 
dependencies. The improvement that we derived from interpolating the different models 
above indicates that using multiple models may be the most fruitful path in the future. 
In any case, a parsing model of the sort that we have presented here should be viewed as 
an important potential source of key information for speech recognition. Future research 
will show if this early promise can be fully realized. 
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